Personality Clustering

This notebook is merely to practice, the objective is to try and group the participants into some form of cluster given their answer to personality tests. You can get the dataset from Kaggle here.
From the dataset "codebook" we can read:

This data was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed with the "Big-Five Factor Markers" from the IPIP. Participants were informed that their responses would be recorded and used for research at the beginning of the test, and asked to confirm their consent at the end of the test.

The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

Feature Description
EXT1 I am the life of the party.
EXT2 I don't talk a lot.
EXT3 I feel comfortable around people.
EXT4 I keep in the background.
EXT5 I start conversations.
EXT6 I have little to say.
EXT7 I talk to a lot of different people at parties.
EXT8 I don't like to draw attention to myself.
EXT9 I don't mind being the center of attention.
EXT10 I am quiet around strangers.
EST1 I get stressed out easily.
EST2 I am relaxed most of the time.
EST3 I worry about things.
EST4 I seldom feel blue.
EST5 I am easily disturbed.
EST6 I get upset easily.
EST7 I change my mood a lot.
EST8 I have frequent mood swings.
EST9 I get irritated easily.
EST10 I often feel blue.
AGR1 I feel little concern for others.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I sympathize with others' feelings.
AGR5 I am not interested in other people's problems.
AGR6 I have a soft heart.
AGR7 I am not really interested in others.
AGR8 I take time out for others.
AGR9 I feel others' emotions.
AGR10 I make people feel at ease.
CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I pay attention to details.
CSN4 I make a mess of things.
CSN5 I get chores done right away.
CSN6 I often forget to put things back in their proper place.
CSN7 I like order.
CSN8 I shirk my duties.
CSN9 I follow a schedule.
CSN10 I am exacting in my work.
OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I have a vivid imagination.
OPN4 I am not interested in abstract ideas.
OPN5 I have excellent ideas.
OPN6 I do not have a good imagination.
OPN7 I am quick to understand things.
OPN8 I use difficult words.
OPN9 I spend time reflecting on things.
OPN10 I am full of ideas.

The time spent on each question is also recorded in milliseconds. These are the variables ending in _E. This was calculated by taking the time when the button for the question was clicked minus the time of the most recent other button click.

Feature Description
dateload The timestamp when the survey was started.
screenw The width the of user's screen in pixels
screenh The height of the user's screen in pixels
introelapse The time in seconds spent on the landing / intro page
testelapse The time in seconds spent on the page with the survey questions
endelapse The time in seconds spent on the finalization page (where the user was asked to indicate if they has answered accurately and their answers could be stored and used for research. Again: this dataset only includes users who answered "Yes" to this question, users were free to answer no and could still view their results either way)
IPC The number of records from the user's IP address in the dataset. For max cleanliness, only use records where this value is 1. High values can be because of shared networks (e.g. entire universities) or multiple submissions
country The country, determined by technical information (NOT ASKED AS A QUESTION)
lat_appx_lots_of_err approximate latitude of user. determined by technical information, THIS IS NOT VERY ACCURATE. Read the article "How an internet mapping glitch turned a random Kansas farm into a digital hell" https://splinternews.com/how-an-internet-mapping-glitch-turned-a-random-kansas-f-1793856052 to learn about the perils of relying on this information
long_appx_lots_of_err approximate longitude of user

The analysis

Let's import some data and take a look at it

we'll set the folder path and seed random processes to get a deterministic environment and allow reproducibility

we'll get the data and follow the advice on the codebook regarding IPC, and GPS info. we'll also delete some other metadata variables that I believe not to be useful

let's take a peek

for the categorical variables we see different amounts of skewness on each one, usually left-skewed as it seems option 0 is not much used in many of them. Onthe time variables, we seem to have some big outliers that makes the graph look like all data is on 0. However it is good that we only have few outliers and data is concentrated on more normal values. People shouldn't have to take 1e8 seconds to answer a question.

we have 1141 nulls that we can drop

We also have 221 unique countries, however, the last graph just above this cell already let's us know that US has a desproportionate amount of data, about half the dataset. We could try and balance it out but that would halve our dataset so we'll see what comes out of it.
Let's try to visualize it better

from the table we can aslo see that there are negative time amounts, which should't be possible. let's clean those out

We'll create a couple of datasets so we can run them through our model and see what is convenient

First dataset will have all the variables, we'll scale everything (even though scaling the categorical features can be kind of pointless, we'll give it a go) and we'll one hot encode the countries

Second dataset will also have all the variables, we'll scale everything, but! we'll label encode country

Third dataset will also have all the variables, but we'll forego scaling non-time variables. After all scaling doesn't make that much sense since they are actually label encoded categorical variables. we'll label encode country as well. And we'll replace time outliers with median value before scaling.

well, that doesn't look that good, but I don't know if there is a good way to handle that many outliers on the time variables. Each one has plenty of high outliers, removing that much data would eliminate too much of our dataset.

Last and fourth dataset, no scaling, only non-time variables. Countries label encoded

The modeling

We'll start modeling now. I wanted to try a bunch of unsupervised clustering models (MeanShift, AffinityPropagation, AgglomerativeClustering, SpectralClustering, DBSCAN, OPTICS, Birch) but sadly my computer does not have the necessary specs to try more than KMeans with this many observations.
We'll check various performance measures, sum of squared errors, davies-bouldin, calinski-harabasz

Davies-Bouldin Index

This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves. Zero is the lowest possible score. Values closer to zero indicate a better partition

Apparently the fewer the clusters the better?

Calinski-Harabasz Index

A higher Calinski-Harabasz score relates to a model with better defined clusters.
The index is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters

Some observations from the error metrics

We've used MiniBatchKmeans to get an overall picture since it trains much faster without losing that much precision. However, given the above observations, we'll now use K-means to train with the simple dataframe and 4 clusters, trying to get the maximum amount of precision. Yes I know its "the big 5" and why not use 5 clusters? Well data is saying otherwise, who is to say the big 5 is not actually the big 4?

Let's look at the clusters data to see if we can distinguish what separates each cluster from one another

Some observations after looking at the graphs

  1. Featues that seem to vary the most for each cluster, meaning they might have an important role defining them, are CSN5 and Country
  2. Cluster 0 sets itself appart the most from the others, it shows small differences in features EXT1, EXT8, EST1, EST4, EST8, ARG4, ARG8, CSN1, CSN3, CSN7, CSN8. Note it does not distinguish itself in any OPN feature. Having that much variance might mean it holds the most amount of observation, meaning it's the biggest cluster and most heterogeneous.
  3. Bar point 1, Cluster 1 shows some differences in features EXT6 and EXT7 only
  4. Bar point 1, Cluster 2 shows some differences in features AGR1, OPN5 and OPN7 only
  5. Bar point 1, Cluster 3 shows some differences in features AGR1 and OPN5 only

Let's check our guess on obs. nº 2

It seems we were right, cluster 0 has over 50% of the data. Might it have US observations?

Another observation, Cluster 1 seems perhaps the most evenly distributed.

Let's do some PCA and see if we can draw a nice 2D graph

That doesn't look so good though...

What if we removed the country and did this again?

(just with df_simple this time around)

Observations

Okey let's this time assume "The big 5" thing works and do 5 clusters with K means, see if we can spot its characteristics

That looks much better! a lot more variance, meaning perhaps each cluster is more uniquely identified

PCA time!

Beautiful!!

Wrap up

There are more things we can do and try, like try and use neural networks to find the clusters. But I believe this dataset has it's limitations and we'll leave those adventures for more interesting datasets. This was a simple exercise of data analysis and some clustering for practice.
As a conclusion, it seems 5 clusters do work best for "The big 5", nothing wrong with some confirmative evidence right?